Theorem wants to stop AIwritten bugs before they ship — and just raised $6M to do it
Jan 27, 2026
As artificial intelligence reshapes software development, a small startup is betting that the industry's next big bottleneck won't be writing code — it will be trusting it.Theorem, a San Francisco-based company that emerged from Y Combinator's Spring 2025 batch, announced Tuesday it has raised $6
million in seed funding to build automated tools that verify the correctness of AI-generated software. Khosla Ventures led the round, with participation from Y Combinator, e14, SAIF, Halcyon, and angel investors including Blake Borgesson, co-founder of Recursion Pharmaceuticals, and Arthur Breitman, co-founder of blockchain platform Tezos.The investment arrives at a pivotal moment. AI coding assistants from companies like GitHub, Amazon, and Google now generate billions of lines of code annually. Enterprise adoption is accelerating. But the ability to verify that AI-written software actually works as intended has not kept pace — creating what Theorem's founders describe as a widening "oversight gap" that threatens critical infrastructure from financial systems to power grids."We're already there," said Jason Gross, Theorem's co-founder, when we asked whether AI-generated code is outpacing human review capacity. "If you asked me to review 60,000 lines of code, I wouldn't know how to do it."Why AI is writing code faster than humans can verify itTheorem's core technology combines formal verification — a mathematical technique that proves software behaves exactly as specified — with AI models trained to generate and check proofs automatically. The approach transforms a process that historically required years of PhD-level engineering into something the company claims can be completed in weeks or even days.Formal verification has existed for decades but remained confined to the most mission-critical applications: avionics systems, nuclear reactor controls, and cryptographic protocols. The technique's prohibitive cost — often requiring eight lines of mathematical proof for every single line of code — made it impractical for mainstream software development.Gross knows this firsthand. Before founding Theorem, he earned his PhD at MIT working on verified cryptography code that now powers the HTTPS security protocol protecting trillions of internet connections daily. That project, by his estimate, consumed fifteen person-years of labor."Nobody prefers to have incorrect code," Gross said. "Software verification has just not been economical before. Proofs used to be written by PhD-level engineers. Now, AI writes all of it."How formal verification catches the bugs that traditional testing missesTheorem's system operates on a principle Gross calls "fractional proof decomposition." Rather than exhaustively testing every possible behavior — computationally infeasible for complex software — the technology allocates verification resources proportionally to the importance of each code component.The approach recently identified a bug that slipped past testing at Anthropic, the AI safety company behind the Claude chatbot. Gross said the technique helps developers "catch their bugs now without expending a lot of compute."In a recent technical demonstration called SFBench, Theorem used AI to translate 1,276 problems from Rocq (a formal proof assistant) to Lean (another verification language), then automatically proved each translation equivalent to the original. The company estimates a human team would have required approximately 2.7 person-years to complete the same work."Everyone can run agents in parallel, but we are also able to run them sequentially," Gross explained, noting that Theorem's architecture handles interdependent code — where solutions build on each other across dozens of files — that trips up conventional AI coding agents limited by context windows.How one company turned a 1,500-page specification into 16,000 lines of trusted codeThe startup is already working with customers in AI research labs, electronic design automation, and GPU-accelerated computing. One case study illustrates the technology's practical value.A customer came to Theorem with a 1,500-page PDF specification and a legacy software implementation plagued by memory leaks, crashes, and other elusive bugs. Their most urgent problem: improving performance from 10 megabits per second to 1 gigabit per second — a 100-fold increase — without introducing additional errors.Theorem's system generated 16,000 lines of production code, which the customer deployed without ever manually reviewing it. The confidence came from a compact executable specification — a few hundred lines that generalized the massive PDF document — paired with an equivalence-checking harness that verified the new implementation matched the intended behavior."Now they have a production-grade parser operating at 1 Gbps that they can deploy with the confidence that no information is lost during parsing," Gross said.The security risks lurking in AI-generated software for critical infrastructureThe funding announcement arrives as policymakers and technologists increasingly scrutinize the reliability of AI systems embedded in critical infrastructure. Software already controls financial markets, medical devices, transportation networks, and electrical grids. AI is accelerating how quickly that software evolves — and how easily subtle bugs can propagate.Gross frames the challenge in security terms. As AI makes it cheaper to find and exploit vulnerabilities, defenders need what he calls "asymmetric defense" — protection that scales without proportional increases in resources."Software security is a delicate offense-defense balance," he said. "With AI hacking, the cost of hacking a system is falling sharply. The only viable solution is asymmetric defense. If we want a software security solution that can last for more than a few generations of model improvements, it will be via verification."Asked whether regulators should mandate formal verification for AI-generated code in critical systems, Gross offered a pointed response: "Now that formal verification is cheap enough, it might be considered gross negligence to not use it for guarantees about critical systems."What separates Theorem from other AI code verification startupsTheorem enters a market where numerous startups and research labs are exploring the intersection of AI and formal verification. The company's differentiation, Gross argues, lies in its singular focus on scaling software oversight rather than applying verification to mathematics or other domains."Our tools are useful for systems engineering teams, working close to the metal, who need correctness guarantees before merging changes," he said.The founding team reflects that technical orientation. Gross brings deep expertise in programming language theory and a track record of deploying verified code into production at scale. Co-founder Rajashree Agrawal, a machine learning research engineer, focuses on training the AI models that power the verification pipeline."We're working on formal program reasoning so that everyone can oversee not just the work of an average software-engineer-level AI, but really harness the capabilities of a Linus Torvalds-level AI," Agrawal said, referencing the legendary creator of Linux.The race to verify AI code before it controls everythingTheorem plans to use the funding to expand its team, increase compute resources for training verification models, and push into new industries including robotics, renewable energy, cryptocurrency, and drug synthesis. The company currently employs four people.The startup's emergence signals a shift in how enterprise technology leaders may need to evaluate AI coding tools. The first wave of AI-assisted development promised productivity gains — more code, faster. Theorem is wagering that the next wave will demand something different: mathematical proof that speed doesn't come at the cost of safety.Gross frames the stakes in stark terms. AI systems are improving exponentially. If that trajectory holds, he believes superhuman software engineering is inevitable — capable of designing systems more complex than anything humans have ever built."And without a radically different economics of oversight," he said, "we will end up deploying systems we don't control."The machines are writing the code. Now someone has to check their work.
...read more
read less